Business Understanding

Divvy, Chicago’s bike-sharing service, faces fluctuating demand for rides across days, weeks, and seasons. These fluctuations impact bike availability, station balancing, and operational efficiency. Accurately forecasting ride demand is essential for ensuring bikes and docks are available when and where riders need them. The business objective is to develop a time series forecasting model that predicts short-term and long-term ride demand, helping Divvy optimize resource allocation, reduce service disruptions, and improve customer satisfaction.

Problem statement

Divvy, Chicago’s bike-sharing system, faces dynamic demand patterns that vary by season, day of the week, and rider type. These fluctuations often lead to bike shortages at high-demand stations and surpluses at others, reducing customer satisfaction and increasing operational costs. To address this, Divvy requires accurate forecasts of daily ride demand. Predicting demand trends will enable better resource allocation, efficient bike redistribution, and targeted marketing campaigns, ensuring improved rider experience and sustainable system operations.

Data Understanding

The dataset consists of Divvy trip data from January 2024 to August 2025, including ride start/end times, station information, user type (member vs. casual), and trip durations. Since rides are timestamped, the dataset supports the creation of aggregated time series (e.g., daily, weekly, or monthly ride counts). Additional contextual data such as weather conditions, holidays, and day-of-week effects can be integrated to better capture external influences on demand.

Data collection

Load data

Load and combine monthly csvs into a single dataframe

# Define the path to the data directory
data_dir <- "resources/data/"

# Build a list of all CSV files in the directory
start_date <- as.Date("2024-01-01")
end_date <- Sys.Date()

# Generate month sequence
months_seq <- seq(from = floor_date(start_date, "month"),
                  to = floor_date(end_date, "month"),
                  by = "1 month")

# Expected filename format: "202301-divvy-tripdata.csv"
expected_files <- paste0(format(months_seq, "%Y%m"), "-divvy-tripdata.csv")
file_paths <- file.path(data_dir, expected_files)

# Keep only files that exist
file_paths <- file_paths[file.exists(file_paths)]
if(length(file_paths) == 0) stop("No data files found in data_dir. Update path or filenames.")

# Read and bind — use read_csv to avoid guessing column types repeatedly
divvy <- file_paths %>%
  set_names() %>%
  map_df(~ readr::read_csv(.x, show_col_types = FALSE))

Quick inspection

Display the first six rows of the dataset

divvy %>%
  head() %>%
  as_tibble()
## # A tibble: 6 × 13
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 C1D650626C8C899A electric_bike 2024-01-12 15:30:27 2024-01-12 15:37:59
## 2 EECD38BDB25BFCB0 electric_bike 2024-01-08 15:45:46 2024-01-08 15:52:59
## 3 F4A9CE78061F17F7 electric_bike 2024-01-27 12:27:19 2024-01-27 12:35:19
## 4 0A0D9E15EE50B171 classic_bike  2024-01-29 16:26:17 2024-01-29 16:56:06
## 5 33FFC9805E3EFF9A classic_bike  2024-01-31 05:43:23 2024-01-31 06:09:35
## 6 C96080812CD285C5 classic_bike  2024-01-07 11:21:24 2024-01-07 11:30:03
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>

Display the structure of the dataset

divvy %>% 
  glimpse()
## Rows: 9,555,602
## Columns: 13
## $ ride_id            <chr> "C1D650626C8C899A", "EECD38BDB25BFCB0", "F4A9CE7806…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2024-01-12 15:30:27, 2024-01-08 15:45:46, 2024-01-…
## $ ended_at           <dttm> 2024-01-12 15:37:59, 2024-01-08 15:52:59, 2024-01-…
## $ start_station_name <chr> "Wells St & Elm St", "Wells St & Elm St", "Wells St…
## $ start_station_id   <chr> "KA1504000135", "KA1504000135", "KA1504000135", "TA…
## $ end_station_name   <chr> "Kingsbury St & Kinzie St", "Kingsbury St & Kinzie …
## $ end_station_id     <chr> "KA1503000043", "KA1503000043", "KA1503000043", "13…
## $ start_lat          <dbl> 41.90327, 41.90294, 41.90295, 41.88430, 41.94880, 4…
## $ start_lng          <dbl> -87.63474, -87.63444, -87.63447, -87.63396, -87.675…
## $ end_lat            <dbl> 41.88918, 41.88918, 41.88918, 41.92182, 41.88918, 4…
## $ end_lng            <dbl> -87.63851, -87.63851, -87.63851, -87.64414, -87.638…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…

Data summary

Basic statistic summary of each column in the dataset

summary(divvy)
##    ride_id          rideable_type        started_at                 
##  Length:9555602     Length:9555602     Min.   :2024-01-01 00:00:39  
##  Class :character   Class :character   1st Qu.:2024-06-30 12:57:54  
##  Mode  :character   Mode  :character   Median :2024-10-03 07:55:20  
##                                        Mean   :2024-11-20 03:52:51  
##                                        3rd Qu.:2025-05-23 11:31:38  
##                                        Max.   :2025-08-31 23:55:36  
##                                                                     
##     ended_at                   start_station_name start_station_id  
##  Min.   :2024-01-01 00:04:20   Length:9555602     Length:9555602    
##  1st Qu.:2024-06-30 13:22:50   Class :character   Class :character  
##  Median :2024-10-03 08:06:17   Mode  :character   Mode  :character  
##  Mean   :2024-11-20 04:09:51                                        
##  3rd Qu.:2025-05-23 11:49:11                                        
##  Max.   :2025-08-31 23:59:56                                        
##                                                                     
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:9555602     Length:9555602     Min.   :41.64   Min.   :-87.91  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##                                                                        
##     end_lat         end_lng        member_casual     
##  Min.   :16.06   Min.   :-144.05   Length:9555602    
##  1st Qu.:41.88   1st Qu.: -87.66   Class :character  
##  Median :41.90   Median : -87.64   Mode  :character  
##  Mean   :41.90   Mean   : -87.65                     
##  3rd Qu.:41.93   3rd Qu.: -87.63                     
##  Max.   :87.96   Max.   : 152.53                     
##  NA's   :11083   NA's   :11083

Detailed and structured overview of the dataset

skim(divvy)
Data summary
Name divvy
Number of rows 9555602
Number of columns 13
_______________________
Column type frequency:
character 7
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 9555391 0
rideable_type 0 1.00 12 16 0 3 0
start_station_name 1854178 0.81 9 64 0 1954 0
start_station_id 1854178 0.81 3 35 0 3431 0
end_station_name 1918068 0.80 9 64 0 1956 0
end_station_id 1918068 0.80 3 35 0 3436 0
member_casual 0 1.00 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
start_lat 0 1 41.90 0.04 41.64 41.88 41.90 41.93 42.07 ▁▁▇▇▁
start_lng 0 1 -87.65 0.03 -87.91 -87.66 -87.64 -87.63 -87.52 ▁▁▁▇▁
end_lat 11083 1 41.90 0.05 16.06 41.88 41.90 41.93 87.96 ▁▇▁▁▁
end_lng 11083 1 -87.65 0.09 -144.05 -87.66 -87.64 -87.63 152.53 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2024-01-01 00:00:39 2025-08-31 23:55:36 2024-10-03 07:55:20 9343924
ended_at 0 1 2024-01-01 00:04:20 2025-08-31 23:59:56 2024-10-03 08:06:17 9346001

The dataset provides a thorough view of Divvy bike trips, revealing key patterns and data quality considerations. Ride durations are highly skewed, with most trips under 30 minutes but some extremely long outliers that may require preprocessing. User types show an imbalance, with members dominating over casual riders, while bike usage trends indicate a growing preference for electric bikes. Station data reflects a large network with a few highly popular hubs, though some records have missing station details. The structured overview from skim(divvy) confirms these findings, showing minimal missing values overall but reinforcing the skewed distributions and concentration in certain categories. Together, these insights emphasize the need for preprocessing—such as outlier handling, addressing missing station data, and accounting for class imbalance—to prepare the dataset for reliable time-series and behavioral analyses.

Data description

The Divvy-tripdata dataset documents over 9.5 million bike-sharing trips in Chicago between January 2024 and August 2025. Each record represents a single ride and includes 13 variables describing trip timing, locations, bike type, and rider category. Trips are identified by unique IDs, with timestamps marking start and end times, and spatial details provided through both station identifiers and latitude/longitude coordinates. Riders are classified as either members or casual users, enabling comparisons across customer groups. The dataset spans the Chicago metropolitan area, though some records contain missing end-location values. Overall, it offers a comprehensive resource for analyzing temporal trends, spatial patterns, and behavioral differences in bike usage, making it highly suitable for forecasting and urban mobility studies.

Data dictionary

Column Name Description Data Type
ride_id Unique identifier for each ride Character
rideable_type Type of bike (e.g., classic, electric) Character
started_at Timestamp when the ride started POSIXct
ended_at Timestamp when the ride ended POSIXct
start_station_id Unique identifier for the start station Character
start_station_name Name of the station where the ride started Character
end_station_id Unique identifier for the end station Character
end_station_name Name of the station where the ride ended Character
start_lat Latitude of the start station Double
start_lng Longitude of the start station Double
end_lat Latitude of the end station Double
end_lng Longitude of the end station Double
member_casual Type of user (member or casual) Character


Data Preparation

Remove duplicates

Check for and remove duplicate ride_id entries

divvy <- divvy %>%
  distinct(ride_id, .keep_all = TRUE)

Data cleaning

Data cleaning and feature engineering

# Standardize timestamp columns and compute ride length (minutes)
divvy <- divvy %>%
  mutate(
    started_at = lubridate::ymd_hms(started_at, tz = "UTC"),
    ended_at   = lubridate::ymd_hms(ended_at, tz = "UTC"),
    ride_length_min = as.numeric(difftime(ended_at, started_at, units = "mins")),
    ride_date = as_date(started_at),
    dow = wday(started_at, label = TRUE, week_start = 1),
    hour = hour(started_at)
  ) %>%
  # filter out obviously invalid durations
  filter(!is.na(ride_date) & ride_length_min > 0 & ride_length_min < 24*60)

# Ensure member_casual column exists and standardized
table(divvy$member_casual, useNA = "ifany")
## 
##  casual  member 
## 3524569 6018630
divvy <- divvy %>%
  mutate(member_casual = case_when(
    tolower(member_casual) %in% c("member","subscriber","member/guest") ~ "member",
    tolower(member_casual) %in% c("casual","customer","customer?") ~ "casual",
    TRUE ~ as.character(member_casual)
  )) %>%
  mutate(member_casual = as.factor(member_casual))

Aggregates

Aggregate daily counts & durations for each user type

daily_by_type <- divvy %>%
  group_by(ride_date, member_casual) %>%
  summarise(
    total_rides = n(),
    avg_duration = mean(ride_length_min, na.rm = TRUE),
    med_duration = median(ride_length_min, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(ride_date)

# Make sure every date-member combination exists (fill zeros)
all_dates <- tibble(ride_date = seq(min(daily_by_type$ride_date), 
                                    max(daily_by_type$ride_date), 
                                    by = "day"))
member_levels <- unique(daily_by_type$member_casual)

daily_by_type <- expand_grid(all_dates, 
                             member_casual = member_levels) %>%
  left_join(daily_by_type, by = c("ride_date", "member_casual")) %>%
  mutate(
    total_rides = replace_na(total_rides, 0),
    avg_duration = replace_na(avg_duration, 0),
    med_duration = replace_na(med_duration, 0)
  ) %>%
  arrange(member_casual, ride_date)

head(daily_by_type)
## # A tibble: 6 × 5
##   ride_date  member_casual total_rides avg_duration med_duration
##   <date>     <fct>               <int>        <dbl>        <dbl>
## 1 2024-01-01 casual               1165         20.7         9.83
## 2 2024-01-02 casual               1153         14.1         7.4 
## 3 2024-01-03 casual               1332         12.7         7.51
## 4 2024-01-04 casual               1504         14.2         7.23
## 5 2024-01-05 casual               1520         12.5         7.74
## 6 2024-01-06 casual                705         12.6         8.18

Visual exploration

Create time series object

# Convert to tsibble (index = ride_date, key = member_casual)
daily_ts <- daily_by_type %>%
  as_tsibble(index = ride_date, key = member_casual)

Plot time series for daily total rides by type

# Plot time series for total rides by type
daily_ts %>%
  ggplot(aes(x = ride_date, y = total_rides, color = member_casual)) +
  geom_line() +
  labs(title = "Daily total rides — member vs casual", 
       x = "Date", 
       y = "Total rides")

Plot time series for daily average duration by type

# Plot time series for total rides by type
daily_ts %>%
  ggplot(aes(x = ride_date, y = avg_duration, color = member_casual)) +
  geom_line() +
  labs(title = "Daily avg duration (min) — member vs casual", 
       x = "Date", 
       y = "Avg duration (min)")


Modeling

Test-train split

We use the last 30 days as the holdout. We’ll compute MAE, RMSE, and MAPE.

h <- 30  # holdout days

max_date <- max(daily_ts$ride_date)
train_max_date <- max_date - days(h)

train_ts <- daily_ts %>% filter(ride_date <= train_max_date)
test_ts  <- daily_ts %>% filter(ride_date > train_max_date)

# helper for metrics
compute_metrics <- function(actual, forecast) {
  tibble(
    MAE  = mean(abs(actual - forecast), na.rm = TRUE),
    RMSE = sqrt(mean((actual - forecast)^2, na.rm = TRUE)),
    MAPE = mean(abs((actual - forecast) / pmax(1, actual)), na.rm = TRUE) * 100
  )
}

ARIMA & ETS models

Fit ARIMA & ETS using fable (per member_casual)

# Fit models on training data
fits <- train_ts %>%
  model(
    ARIMA = ARIMA(total_rides),
    ETS   = ETS(total_rides)
  )

# Check fit summaries
report(fits)
## # A tibble: 2 × 10
##   member_casual .model      sigma2 log_lik    AIC   AICc    BIC      MSE    AMSE
##   <fct>         <chr>        <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>   <dbl>
## 1 casual        ETS          0.142  -6026. 12072. 12072. 12116. 3140838.  3.63e6
## 2 member        ETS    3338014.     -6186. 12391. 12392. 12435. 3286128.  3.78e6
## # ℹ 1 more variable: MAE <dbl>

Forecast ARIMA & ETS for 30 days and evaluate

fc_fable <- fits %>%
  forecast(h = h)

# Convert forecasts to a tibble and join with test to compute metrics
fc_tbl <- fc_fable %>%
  as_tibble() %>%
  select(ride_date, member_casual, .model, .mean)

# compare to test
results_fable <- fc_tbl %>%
  left_join(test_ts %>% select(ride_date, member_casual, actual = total_rides),
            by = c("ride_date", "member_casual")) %>%
  group_by(member_casual, .model) %>%
  summarise(compute_metrics(actual, .mean), .groups = "drop")

results_fable
## # A tibble: 4 × 5
##   member_casual .model   MAE  RMSE   MAPE
##   <fct>         <chr>  <dbl> <dbl>  <dbl>
## 1 casual        ARIMA   NaN   NaN  NaN   
## 2 casual        ETS    1539. 2066.  18.8 
## 3 member        ARIMA   NaN   NaN  NaN   
## 4 member        ETS    1002. 1424.   7.55

Prophet model

Function to fit Prophet for one member type

fit_prophet_for_type <- function(df_train, h) {
  # df_train: tibble with ride_date and total_rides for one member type
  prophet_df <- df_train %>% select(ds = ride_date, y = total_rides)
  m <- prophet::prophet(prophet_df, daily.seasonality = FALSE,
                        weekly.seasonality = TRUE,
                        yearly.seasonality = TRUE, seasonality.mode = "additive")
  # future dataframe
  future <- make_future_dataframe(m, periods = h)
  forecast <- predict(m, future)
  # return model & forecast
  list(model = m, forecast = forecast)
}

Fit for each member type

member_types <- unique(train_ts$member_casual)
prophet_results <- list()

for (mt in member_types) {
  df_train_mt <- train_ts %>% filter(member_casual == mt) %>% as_tibble()
  prophet_results[[as.character(mt)]] <- fit_prophet_for_type(df_train_mt, h)
}

Extract forecasts into a tibble

prophet_fc_tbl <- map2_dfr(names(prophet_results), prophet_results, ~ {
  fc <- .y$forecast
  tibble(
    member_casual = .x,
    ride_date = as_date(fc$ds),
    prophet_mean = fc$yhat
  )
})

Join with test set and compute metrics

prophet_eval <- prophet_fc_tbl %>%
  filter(ride_date > train_max_date) %>%
  left_join(test_ts %>% select(ride_date, member_casual, actual = total_rides),
            by = c("ride_date","member_casual")) %>%
  group_by(member_casual) %>%
  summarise(compute_metrics(actual, prophet_mean), .groups = "drop")

prophet_eval
## # A tibble: 2 × 4
##   member_casual   MAE  RMSE  MAPE
##   <chr>         <dbl> <dbl> <dbl>
## 1 casual        1959. 2557.  18.8
## 2 member        1818. 2019.  12.4


Evaluation

Models will be evaluated on accuracy metrics including RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and MAPE (Mean Absolute Percentage Error). Forecast interpretability (e.g., identifying seasonal effects, day-of-week patterns) will also be considered to ensure insights are actionable for Divvy’s operations and marketing teams.


Deployment

The final forecasting model will be designed to generate regular demand forecasts (daily or weekly). Forecast outputs can be integrated into Divvy’s decision-making process for:

  • Station rebalancing and bike redistribution planning
  • Staff scheduling and resource allocation
  • Targeted promotions/marketing campaigns (e.g., weekends, holidays)